A Matrix multiplication tutorial

About:

Matrix multiplication is a basic tool of linear algebra, and as such has numerous applications in many areas of mathematics, as well as in applied mathematics, statistics, physics, economics, and engineering. Computing matrix products is a central operation in all computational applications of linear algebra. Given these numerous applications, there exists highly optimized implementations of MM for researchers to use, which are closely tied to hardware one would be running any such implementation. Each hardware vendor supposedly have their own such implementation to squeeze out every bit of performance, and generally performance is prioritized over portability and readability. Underlying hardware may be general purpose computing unit (CPUs) or graphics processing units aka GPUs. For example intel provides us its routine through intel mkl and Nvidia through CuBLAS.

With this post, my intention is to understand various forms of optimizations that are involved in these vendor libraries, in hope to identify opportunities to apply these optimizations with other class of algorithms too. We would be writing this in Nim language, language itself shouldn't matter much as we start better understanding the ideas behind optimizations, but having the complete implementation in a single language also cuts through all the weird installation and dependency mismatching issues.
I am far from expert in knowledge required to implement such complex optimizations but writing more about this subject may take me one step further!

MM:

MM is generally a binary operation b/w 2 matrices, where one matrix's rows count match with other matrix's columns count. In very naive form, each row of C can be calculated by keeping a row for A fixed and then going column by column for B and calculating dot-product for each such combination of vectors. Each element for resulting matrix C thus requires k multiply-add operations, hence totaling m x n x k operations. This tells us about pure computational cost, but data has to be moved to and fro from main memory to ALU. Such data-movements cost may dominate the actual runtime of an MM operation if are not optimized.

alt MM

Each element in the C matrix, is a result of dot-product among row vector from A and a column vector from B. Note that we may not need that intermediate vector/array to store the element-wise multiplication and can directly accumulate the results into a register.

Accumulation:

Since addition is commutative and associative, we don't need to go along fully along K. We can go upto some length say kc, and accumulate the result calculated so far at appropriate index at C. At surface level this may not seem important, but becomes really useful when we understand a bit more about Cache made available by underlying hardware.

alt MM

Loading a value/byte into CPU registers for data manipulation from RAM, generally fetch bytes from contiguous locations too. These extra bytes are then cached. Idea is that if we later access a memory location, data requirement could be fulfilled from cache rather than from main-memory. Hardware takes care of stuff like cache invalidation too, if some other process has written to a memory location already cached. Loading data from a cache is much much faster than RAM. So it would be in our favor to be able to read as much as data from CACHE. note that we generally can't directly manipulate a cache's values/bytes and there generally exists multiple level of caches ranging from private cache (L1) to L3 (which is shared among multiple cores). For now assume, cache exist b/w cpu registers and main memory, and some extra contiguous bytes would be read into cache along with every read/load operation if not exist already.

Memory/RAM can be assumed as a large flat 1D array/buffer containing data, but MM is a 2D problem. It would not be possible to read a 2D submatrix directly without loading some extra bytes. 2D semantics are generally provided by some library like numpy/pytorch to be able to easier to write code in more abstract form. Internally library converts 2d indices into an appropriate offset/index into underlying buffer.

In the figure above, we proposed to read a subset of zeroth row (kc) for A and a subset of zeroth column (kc) for B iteratively until we consume the full length k.

We can instead choose a different pattern too like reading kc length vector from first column of B, keeping the kc vector from A fixed. If we get lucky this new kc length vector would be fulfilled by cache and not by main memory. This is a fair assumption irrespective of memory layout for B matrix. But now would be needing a new accumulator too, for storing partial dot-product resulting from A zeroth row and B first column. Generally registers are used for accumulation and hardware have more than one such registers.

alt MM

We could argue that we could use more accumulators depending on hardware, which would require more kc length vectors from consecutive B columns. Say we do it for nc such vectors, we now need nc accumulators to hold partial results for C nc elements a particular row. In this whole process we read a single kc length vector for zeroth A row, and kept reading nc number of kc length vectors from nc different B columns. This process can be seen as blocking along A rows. Conversely, we can do do similar process by blocking along b columns. Figure below showcase this scenario.

alt MM

By choosing different values for mc/nc/kc, we would be changing the access patterns for data and hopefully choose one best fitting to the cache hierarchy. Also note that accumulation involves read and a write operation, but both reading and writing are being done to and from a register. This can be seen as register blocking/tiling. Using registers to store the partial results, decrease the number of stores operations by a factor depending upon how we have divided the A and B matrices into submatrices. But the number of store operations would be more than what would be in naive form.

We can treat these nc,mc and kc as hyper-parameters whose optimum values would be a function of underlying hardware Cache size and policies, but nonetheless enable us to tune our final algorithm without knowing exact architecture details. We can write some code based on above theory.

let
    m = 8
    n = 2
    k = 4

    kc = 2
    mc = 4
    nc = n    # assuming no blocking along n for now.

for p in countup(0, k, kc):       # blocking along K dimension for both A and B.
    for i in countup(0, m, mc):   # blocking along M dimension      

        # a_submatrix ->  A[i:(i+mc), p:(p+kc)]     # a matrix of shape [mc, kc]
        # b_submatrix ->  B[p:(p+kc), 0:nc]          # a matrix of shape [kc, nc]
        # c_submatrix ->  C[i:(i+mc), 0:nc]          # matrix of shape [mc, nc]

        # calculate matrix multiplication by storing (mc x nc) partial accumulations into C.
        innerKernel(a_submatrix, b_submatrix, c_submatrix)

Note that for now, our code except innerKernel is purely abstract i.e we don't do any read/write operations and these outer-loops are fully portable. InnerKernel calculates the resulting c submatrix, by accumulating the dot-products as discussed above.

proc innerKernel(a_submatrix, b_submatrix, c_submatrix)=

    # mc: number of rows in a_submatrix
    # nc: number of cols in b_submatrix
    # kc: common dimension.

    for p in countup(0, kc, 1):
        for j in countup(0, nc, 1):                    # optimization if each row of b comes from cache, much faster than RAM.
            let b_temp = b_submatrix[p, j]             # store it into a register, rather than reading again and again from cache. 

            for i in countup(0, mc, 1):                # moving along a_submatrix rows.
                #c_submatrix[i, j] += a_submatrix[i, p] * b_submatrix[p, j]  
                c_submatrix[i, j]  += a_submatrix[i, p] * b_temp    # optimization by reading into register (much faster than CACHE).

Above routine is accumulating values into submatrix C at each iteration, which involves a costly store operation, and is not using registers as accumulators as discussed above. So we observe nc = 2 and mc = 4, and hence we can write the above routine as shown below:

proc innerKernel(a_submatrix, b_submatrix, c_submatrix)=

    # mc: number of rows in a_submatrix
    # nc: number of cols in b_submatrix
    # kc: common dimension.

    for p in countup(0, kc, 1):

        # registers accumulators for c submatrix.
        var
           c_00:float32 = 0 
           c_01:float32 = 0
           c_10:float32 = 0
           c_11:float32 = 0
           c_20:float32 = 0
           c_21:float32 = 0
           c_30:float32 = 0
           c_31:float32 = 0

        # read a single row (mc = 4) into 4 registers       
        var 
            a_0p:float32 = a_submatrix[0,p]  # zeroth row, p col
            a_1p:float32 = a_submatrix[1,p]   # 1st row , p col
            a_2p:float32 = a_submatrix[2,p]  # 2nd row, p col
            a_3p:float32 = a_submatrix[3,p] # 3rd row, p col

        # zeroth column of c submatrix, mc rows.
        var b_temp = b_submatrix[p,0]       
        c_00 += a_0p * b_temp
        c_10 += a_1p * b_temp
        c_20 += a_2p * b_temp
        c_30 += a_3p * b_temp

        # 1st column for c submatrix, mc rows.
        b_temp = b_submatrix[p,1]       
        c_01 += a_0p * b_temp
        c_11 += a_1p * b_temp
        c_21 += a_2p * b_temp
        c_31 += a_3p * b_temp       


    # now store those 8 c_xx registers into RAM/main-memory.
    # now 8(mc * nc) stores to RAM, instead of (mc * kc * nc). 
    c_submatrix[0,0] += c_00
    c_submatrix[1,0] += c_10
    # ......

We can take a moment to be sure that updated version of innerKernel is technically equivalent as the earlier version. With current form idea is to expose the possible opportunities for vectorization. Majority of CPUs (intel/arm/amd) ship with vector extensions to allow speeding up various arithmetic operations. For example intel/amd dub their such extensions as SSE/AVX/AVX2 depending upon the length of registers bank in these extensions. Arm use NEON/ASimd to refer to their vector extensions. These registers have capacity to hold larger number of bits than general purpose registers. For example SSE extensions have registers to hold 128 bits of data.

For 128 bits, we can fit 4 float32 values into a single SSE register. This would allow us to do addition/multiplication for 4 operands rather than one! This paradigm is referred to as SIMD (single instruction multiple data). This would have clear performance benefits if we could incorporate these registers in our calculations. For now assume we have 128 bits register.

This would allow us to calculate a single column(4 values) for C submatrix in one go. (Note that not all SIMD instructions have a latency of 1 cycle!)

    proc innerKernel(a_submatrix, b_submatrix, c_submatrix)=

        # mc: number of rows in a_submatrix
        # nc: number of cols in b_submatrix
        # kc: common dimension.

        for p in countup(0, kc, 1):
            var
               c_0:m128 = mm_setzero_ps()  # zeroth column for C submatrix.
               c_1:m128 = mm_setzero_ps()  # 1st column for c submatrix.    
            var
                a_0:m128 = mm_loadu_ps(addr(a_submatrix[0,0])) # load 4 float value into single SIMD register. (we assume those values lies consecutively)

            var b_temp = b_submatrix[p,0]       
            var b_0:m128 = mm_set1_ps(b_temp) # equivalent to broadcasting .
            c_0 = mm_add_ps(c_0, mm_mul_ps(b_0, a_0))   # zeroth column of C submatrix.

            b_temp = b_submatrix[p,1]       
            var b_1:m128 = mm_set1_ps(b_temp)       # equivalent to broadcasting .
            c_1 = mm_add_ps(c_1, mm_mul_ps(b_1, a_0))    # first column of C submatrix.     

        c_submatrix[:,0] += c_0
        c_submatrix[:,1] += c_1

We can note that there are various assumptions as mc = 4 made it easy to load 4 float32 values into a single go, it is not very portable if we are keeping the mc value as 4 fixed and hence may not be able to tune our algorithm for different hardware(cache sizes).

Some of the flexibility we can gain by renaming the innerKernel to something like addDot_4x2, as each call to this routine would update/accumulate a 4x2 submatrix in C and then instead having two new outer-loops.

proc innerKernel(mc, nc, kc:int)=
    for j in countup(0, nc, 2):
        # equivalent to reading a submatrix of shape [kc, 2]
        for i in countp(0, mc, 4):
            # equivalent to reading a submatrix of shape [4, kc]
            addDot_4x2(....)            # also known as microKernel.

The above form would allow us to still use optimum values for kc/mc/nc for underlying hardware irrespectively by running a grid-search or something. Only addDot_4x2 aka microkernel can be isolated to use simd instructions or assembly code . Rest of the code would still be very portable. Appropriate microKerenel can be chosen by detecting the hardware at runtime or by specifying in a configuration file. This is also the philosophy preached by BLIS (a high performance BLAS library with a focus on portability) . It recommends having an init function to set up some parameters that tie to hardware and then use common API on all platforms. User would always have option to run a generic microkernel and can gradually update the microkernel based on the hardware capabilities. It is really practical as assembly code would be isolated to a single routine. Except the microkernel, all loops would just provide an abstraction and hence fully portable.

We already discussed that if we are reading a 2D submatrix from original matrix, even with tiling applied cache would still fetch some bytes which would remain unused during a particular microKernel call. (underlying data is kept in a 1D buffer) So literature also suggests to repack the tiles e.g copy only the necessary bytes to contiguous memory locations to utilize more of cache capacity. This should also allow us to keep bigger submatrices in cache, but it is not free as we would have to copy a significant amount of bytes to a new location. Understand that each element of A or B would be used significantly more than it is being read during C calculation. For example each element of A would be used nc times and similarly each element in b is being used mc times. repacking cost of A can be roughly assumed as proportional to 2*mc*kc which would be much smaller than cost of reading nc*mc*kc values.

Now code may look like this:

proc innerKernel(mc, nc, kc:int)=
    for j in countup(0, nc, 2):
        packSubMatrixB(2, kc)
        for i in countp(0, mc, 4):
            packSubMatrixA(4, kc)
            addDot_4x2(....)            # also known as microKernel.

At this point, we have looked at various forms of optimizations mostly aimed at hopefully better utilization of cache/caches to bypass the limitations imposed by main memory-bandwidth. Choosing different values for kc/mc/nc easily would make it easy to tune our algorithm to better utilize the cache without knowing exact details about cache sizes and policies !!

Microkernel routine would be highly architecture specific and user would benefit from information like number of SIMD/NEON registers and SIMD instructions latency. For example Haswell architectures are supposed to have 2 FMA (fused multiply additions) units, each such unit would have a latency of 5 cycles. Assume a generic addition instruction takes 1 cycle for adding data in 2 registers, we only get a speed up of 8*2 / 5 if we write our code naively. It would be on user to hide that latency by exposing instruction-leve parallelism, idea is that we interleave arithmetic instructions with relatively independent instructions like loading some data into a register which could be executed parallel, so that SIMD UNITS have always data to process. This would also depend on how many SIMD registers are available to be used at once and capability of hardware features like prefetcher and branch-predictors. Current architecture pipelines have become insanely complex, but trying to understand should expose us to some fundamental concepts involved at least!

This is a lot to understand and my basic strategy is to make some hypothesis and then see if it works. It is also not easy to always measure performance improvements accurately. I still have a lot to learn and sometimes the best tiling or resulting schedule would be un-intuitive to me. But being able to experiment and having some benchmarking setup is the key to improve rapidly. Note that vendor libraries would try to pick the exact microkernel based on your architecture and hence include almost all possible optimizations.

Another form of optimization remains is CORE-LEVEL parallelism, current architectures are generally multi-cores with independent SIMD units and L1/L2 caches. User may have to identify the opportunities to apply this parallelism which may not be straightforward due to data-dependencies and overhead involved in scheduling those threads. Also then a shared-cache would be used by all the threads and it is possible that a particular scheduling may end up in a lot of cache-misses as a different thread access pattern may evict out the data earlier fetched. Multi-threading documentation from BLIS document describes the various forms of opportunities and ways to achieve this is in greater detail.

alt MM

A simpler way to use multi-threading is through routines from libraries implementing OpenMP specification like libgomp/intel omp which allows shared-memory parallelism. I tend to use POSIX threads, which involve manual management for threads but are much more portable and easier to debug. Another strategy is to use thread-pools which aims to eliminates the thread creation and destroying overhead, but instead involves signals and queues to carry out tasks.

For more formal treatment of matrix-multiplication by partitioning of matrices please read classic Anatomy of High-Performance Matrix Multiplication . It also includes required cost analysis to motivate the repacking of matrices for practical use-cases.

Results:

Mine Current implementation based on above optimizations is able to achieve 82-85% of openBlas for form of square matrices for e.g m=n=k. Specifically for 1024 x 1024 (float32), openblas takes about ~7-8ms while mine implementation takes around 8.5-10ms. For a 1920 x 1920 mine takes 53-55ms while openblas takes around 44-46 ms. I will run more tests on big/fat and thin/tall type of matrices once i add some sanity checks. Implementation is completely written in Nim and is inspired from BLIS. ( All experiments have been run on Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz 2.30 GHz on Windows with GCC 11.1.0)

I am not using any tiling along n dimension, and parallel threads are used only along m dimension. This leads to a fork-join paradigm along k dimension. In future i will try to use thread-pools to divide the work more granularly but it would be a more involved affair in itself!

Why:

These highly optimized blas routines may seem more like art than code and only by being a dedicated apprentice you could notice the threads holding it together. In my opinion it would never be a waste of time to try to learn a topic at a deeper level. Surely you can argue to let compiler do optimizations for us by writing the code in a certain way, but it would take you only so far as tools only work best for users who know when to distrust them!

But being so optimized so leads to brittleness, as such routines wouldn't work across function boundaries. For example doing a naive relu(tensor/array) after an optimized matrix multiplication could squander the benefits gained from subsequent operation. Library either would need to implement all such possible fusion routines or would need someone to modify existing microKernel for a particular fusion operation. User still argue to do something like store(mem_location, relu(x)), it should work assuming doing a operation on register data would lead to almost no-cost op. And it is a fair assumption, but some operations like exponential are really slow as provided by glibc , you could write a much faster implementation for smaller values as input to exp. (Most of the Deep networks have intermediate tensors with values close to 1 so this is a fair assumption)

Nvidia provides highly optimized implementations of algorithms through CuBLAS, CuDNN, CuRAND etc. But such implementations are generally shipped as binary blobs, and tie very specifically to family of a underlying GPU on which such implementation is being run. I remember using CuDNN for convolution operation a few years before, it didn't have support for passing bias as arguments, so adding bias was more costly than the convolution operation itself. Bias had to be added using separate binary addition operation, which included the cost of moving data to and from memory. I am sure cuDNN has evolved quite a lot and supports much higher number of primitives than a few years back.

My point is that we could always benefit by some domain information but to leverage that we need to able to understand code at deeper level than what a compiler frontend is exposing!!

Remarks:

The intention of writing it in a higher-level language than C and assembly is to make it much easier to read and modify code. Behind the scenes we are wrapping various SIMD/NEON intrinsics, so low level code is there, but now is much easier to hack upon. Also complex tool-chains are involved due to legacy reasons for these libraries, and that is a much bigger overhead than modifying and understanding the code. Writing it in a single language makes it much easier to ship and distribute a package with more deterministic build-time behavior.

Languages like nim, rust, zig etc are shipped with standard build tooling which makes it easy to run and build source code and do away with extra infrastructure headache with no apparent benefits.

I didn't discuss the actual translation of 2d indices to offset into underlying 1D array and memory layouts for matrices. As we start using optimum partitions, a particular layout wouldn't matter much and routine becomes flexible enough to accept layout as an argument unlike some APIs focused on a particular memory layout of matrices.

Mratsim has also implemented a BLAS backend in pure Nim, which can generate and dispatch kernels more elegantly using macros. Feel free to visit repo for benchmarks and optimizations behind a practical BLAS implementation.

I have tried to discuss some optimizations involved in vendor specific BLAS routines but it is quite an involved topic and i may have some misconceptions in my understanding! Please feel free to provide any kind of feedback.

A Matrix multiplication tutorial

About:

MM:

Accumulation:

Results:

Why:

Remarks:

Resources: